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Abstract 

Affine rank minimization algorithms typically rely on calculating the gradient of a data error followed by a 
singular value decomposition at every iteration. Because these two steps are expensive, heuristic approximations 
are often used to reduce computational burden. To this end, we propose a recovery scheme that merges the two 
steps with randomized approximations, and as a result, operates on space proportional to the degrees of freedom 
in the problem. We theoretically establish the estimation guarantees of the algorithm as a function of approxi- 
mation tolerance. While the theoretical approximation requirements are overly pessimistic, we demonstrate that 
in practice the algorithm performs well on the quantum tomography recovery problem. 



1 Introduction 

In many signal processing and machine learning applications, we are given a set of observations y G K p of a rank-r 
matrix X* G ^jnxn as y = j\X.* + e via the linear operator A : R mxn — ► M p , where r <C min{m, n} and e G W is 
additive noise. As a result, we are interested in the solution of 



/(X) 

subject to rank(X) < 



minimize 

xeR mx ™ 



(1) 



where /(X) 



AX 



is the data error. While the optimization problem in ([lj is non-convex, it is possible to 
obtain robust recovery with provable guarantees via iterative greedy algorithms (SVP) |MJD10 KC12 
relaxations [RFP10, CR09 from measurements as few as p = 0(r(m + n — r)). 



or convex 



Currently, there is a great interest in designing algorithms to handle large scale versions of ([!]) and its variants. 
As a concrete example, consider quantum tomography (QT), where we need to recover low- rank density matrices 
from dimensionality reducing Pauli measurements FGLE12] , In this problem, the size of these density matrices 
grows exponentially with the number of quantum bits. Other collaborative filtering problems, such as the Net- 
flix challenge, also require huge dimensional optimization. Without careful implementations or non-conventional 
algorithmic designs, existing algorithms quickly run into time and memory bottlenecks. 

These computational difficulties typically revolve around two critical issues. First, virtually all recovery algo- 
rithms require calculating the gradient V/(X) = 2A*(A(X) — y) at an intermediate iterate X, where A* is the 
adjoint of A. When the range of A* contains dense matrices, this forces algorithms to use memory proportional to 
0(mn). Second, after the iterate is updated with the gradient, projecting onto the low-rank space requires a partial 
singular value decomposition (SVD). This is usually problematic for the initial iterations of convex algorithms, where 
they may have to perform full SVD's. In contrast, greedy algorithms KC12 fend off the complexity of full SVD's, 



since they need fixed rank projections, which can be approximated via Lanczos or randomized SVD's HMT11 . 

Algorithms that avoid these two issues do exist, such as WYZ12 RR13 LRS + 11 Laul2 , and are typically 
based on the Burer-Monteiro splitting IBM03 . The main idea in Burer-Monteiro splitting is to remove the non- 



convex rank constraint by directly embedding it into the objective: as opposed to optimizing X, splitting algorithms 



directly work with its fixed factors UV = X in an alternating fashion, where U G 



and V G 



for some 
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f > r. Unfortunately, rigorous guarantees are difficult^] The work JNS12 has shown approximation guarantees 
if A satisfies a restricted isometry property with constant #2r < K 2 /(100r) (in the noiseless case), where K = 
cri(X*)/CT r (X*), or 52r < l/(3200r 2 ) for a bound independent of k. The authors suggest that these bounds may be 
tightened, and that practical performance is better than the bound suggests. 

In this paper, we merge the gradient calculation and the singular value projection steps into one and show that 
this not only removes a huge computational burden, but suffers only a minor convergence speed drawback in practice. 



Our contribution is a natural but non-trivial fusion of the Singular Value Projection (SVP) algorithm in |MJD10 



and the approximate projection ideas in KC12 
that has been considered in 



The SVP algorithm is an iterative hard-thresholding algorithm 



MJD10 GM11 . Inexact steps in SVP have been considered as a heuristic GM11 



but have not been incorporated into an overall convergence result]^] A non-convex framework for affine rank 
minimization (including variants of the SVP algorithm) that utilizes inexact projection operations with provable 
signal approximation and convergence guarantees is proposed in KC12| . Neither MJD10,KC12 considers splitting 
techniques in the proposed schemes. 

This work, departing from MJD10 KC12 , engineers the SVP algorithm to operate like splitting algorithms 
that directly work with the factors; this added twist decreases the per iteration requirements in terms of storage 
and computational complexity. Using this new formulation, each iteration is nearly as fast as in the splitting 
method, hence removing a drawback to SVP in relation to splitting methods. Furthermore, we prove that, under 
some conditions, it is still possible to obtain perfect recovery even if the projections are inexact. In particular, 
our assumption is that the linear map A satisfies the rank restricted isometry property, and in section |5.1| we give 
an application that satisfies this assumption, allowing perfect recovery (in the noiseless case) or stable recovery 
(in the presence of noise) from measurements p -C ran. This approach has been used for convex RFP10 and 
non- convex 



MJD10 KC12 algorithms to obtain approximation guarantees. 



2 Preliminary material 

Notation: we write Vq to be an orthogonal projection onto the closed set when it exists. For shorthand we write 
V r to mean 'P{x:rank(x)<r} (which does exist by the Eckart- Young theorem). Computer routine names are typeset 
with a typewriter font. 



2.1 R-RIP 



The Rank Restricted Isometry Property (R-RIP) is a common tool used in matrix recovery RFP10 MJD10 KC12 



Definition 1 (R-RIP for linear operators on matrices RFP10 ). A linear operator A 
R-RIP with constant 8 r {A) G (0, 1) if, VX 6 R mxn with rank(X) < r, 

(1 - S r (A))\\X\\l < \\AX\\l < (l + <5 r (,4))||X|| 2 F , 



l p satisfies the 
(2) 



We write S r to mean S r (A). 



2.2 Additional convex constraints 

Consider the variant 

minimize /(X) 

xeffimx " (3) 
subject to rank(X) < r, X e C, 

for a convex set C. Our main interests are C+ = {X : X y 0} and the matrix simplex Ca = {X : X y 0, trace(X) = 
1}. In both cases the constraints are unitarily invariant and the projection onto these sets can be done by taking 



H r > y/p, then BM03 shows their method obtains a global solution, but this is impractical for large p. Moreover, it is shown 
that the explicit rank f splitting method solves a non-convex problem that has the same local minima as Q (if r = r). However, the 
non-convex problems are not equivalent (e.g. U = 0, V = is a stationary point for the splitting problem whereas X = is generally 
not a stationary point for Furthermore, recovery bounds for non-convex algorithms, as in [GK09| and the present paper, are 

statements about a sequence of iterates of the algorithm, and say nothing about the local minima. 

2 Inexact steps are often incorporated into analysis of algorithms for convex problems. Of particular note, |Laul2| allows inexact 
eigenvalue computations in a modified Frank- Wolfe algorithm that has applications to (TIJl. 
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Algorithm 1 RandomizedSVD 



Finds Q such that X ss VqX where Vq — QQ* 

Require: Function h : Z i-> XZ 
Require: Function h* : Q M> X*Q 
Require: r € N 
Require: q G N 

I = r + p 

n a n x £ standard Gaussian matrix 



9 
10 
11 
12 



Q <- QR(PF) 

for j = 1, 2, . . . , g do 

Z^-QR(h*(Q)) 

Q QR(h(Z)) 
end for 

(C7, £, <- FactoredSVD(Q, 7^ , Z) 

Let E r be the best rank r approximation of £ 

return (U,T, r ,V) 



I j Rank of output 
/ / Number of power iterations to perform 
/ / Typical value of p is 5 



// The QR algorithm to orthogonalize W 



II X-i+i = UT,V* in the appendix 
// Xi + i = WS r V* in the appendix 



Algorithm 2 FactoredSVD(C7, D, V) 

Computes the SVD UY,V* of the matrix X implicitly given by X — UDV* 
v. (U,Rjj) <- QR(f7) 
2: (V,R V )^QR{V) 
3: (u,E,v) 4- DenseSVD(i? [/ Di?^) 
4: return (U, E, V) ^ {Uu, S, Fu) 



the eigenvalue decomposition and projecting the eigenvalues. Furthermore, for these specific C, 'P{x:rank(x)<r}nc 



Vc ° Tr (this is not obvious; see BCKK13 



In general, any convex set C satisfying the above property is compatible with our algorithm, as long as X* G C. 
We overload notation to use Vc to denote both the projection of X onto the set as well as the projection of its 
eigenvalues onto the analogous set. 

2.3 Approximate singular value computations 

The standard method to compute a partial SVD is the Lanczos method. By itself it is not numerically stable and 
requires re-orthogonalization and implicit restarts. Excellent implementations are available, but it is a sequential 
algorithm that calls matrix-vector products. This makes it more difficult to parallelize, which is an issue on 
modern multi-processor computers. The matrix-vector multiplies are also slower than grouping into matrix-matrix 
multiplies since it is harder to predict memory usage and this will lead to cache misses; it also precludes the use of 
theoretically faster algorithms such as Strassen's. Theoretically, there are no known relative error bounds in norm 
(a la Theorem [T]). 

As an alternative, we turn to randomized linear algebra. On this front, we restrict ourselves to algorithms that 
require only multiplications, as opposed to sub-sampling cntries/rows/columns, as sub-sampling is not efficient for 
the application we present. The randomized approach presented in Alg orithm [T has been rediscovered many times, 



but has seen a recent resurgence of interest due to theoretical analysis [HMT11 



Theorem 1 (Average Frobenius error). Suppose X G W nxn 7 and choose a target rank r and oversampling parameter 
p > 2 where £ := r + p < min{m, n}. Calculate Q and Vq via RandomizedSVD using q = and set X = PqX 
(which is rank £). Then 



E||X-X||| <(l + e) ||X-X r " 2 



3 This formula is literally true for C+ and {X : X y 0, trace(X) < 1}. For C = {X : X y 0, trace(X) = 1} constraints, Vc 
can increase the rank, so formally we must work on a restricted subspace and then embed back in the larger space, but this poses no 
theoretical issues. 
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Algorithm 3 Efficient implementation of SVP, K. = {3 



Require: step-size fi > 0, measurements y, initial points uq € JC mxr 1 vq e /C ,ixr , do G /C r 
Require: (optional) unitarily invariant convex set C 
Require: Function A : (u, d, v) n- «A(udiag(d)i;*) 
Require: Function At : (z, w) n- A.* (z)w 
Require: Function At* : (z,w) i-> (A*(z))*w 

1: V-i «- 0, u_i «- 0, d-\ <- 

2: for i = 0, 1, . . . do 

3: Compute /8j // See text 

4: % «- <- [Vi,Vi-i] 

5-. d y <- [(1 + /3 i )d i ,-/3 i d i -i] 

6: z A(tt„, c? y , u„) y // Compute the residual 

7: Define the functions 

h : w i — y Uy diag(d y )v*w — /iAt(z, w) 

h* : w ^ v y diag(dy)u y w — /lAt* (z, to) 
8: (itj+i, RandomizedSVD(h, h*,r) or (itj+i, d»+i, Mi+i) <^ RandomizedEIG(h, h*, r) 

9: rf m <- Pc^i+i) //Optional 
10: end for 

11: return X 4— UidiV* //If desired 

where X,. is the best rank r approximation in the Frobenius norm o/X and e = — rr- 



The theorem follows from the proof of Thm. 10.5 in HM Tllj (note that Thm. 10.5 is stated in terms of 



E||X — X|| f which is not the same as y E|jX — X|| F ). The expectation is with respect to the Gaussian r.v. in 

RandomizedSVD. For the sake of our analysis, we cannot immediately truncate X to rank r since then the error bound 
in HMT11 is not tight enough. Thus, since X is rank £, in practice we even observe that ||X — X|| F < ||X — X r || F , 
especially for small r, as shown in Figure [3] The figure also shows that using q > power iterations is extremely 
helpful, though this is not taken into account in our analysis since there are no useful theoretical bounds (in the 
Frobenius norm). Note that variants for eigenvalues also exist; we refer to the equivalent of RandomizedSVD as 



RandomizedEIG, which has the property that U — V and E need not be positive (cf., [HMT11 GM13| ) 



3 Algorithm 

3.1 Projected gradient descent 

Our approach is based on the projected gradient descent algorithm: 

X m = V e r (X i+1 - MiV/(Xi)), (4) 

where Xj is the i-th iterate, V/(-) is the gradient of the loss function, fii is a step-size, and V^(-) is the approximate 
projector onto rank r matrices given by RandomizedSVD. If we include a convex constraint C, then the iteration is 

Xi+i = V c {V^i+i - W V/(X<))). (5) 

In practice, Nesterov acceleration improves performance: 

Y i+1 = (l + ft)X 4 -AX^i (6) 
X i+1 = V(Yi - W V/(Yi)), (7) 



where Pi is chosen /3j = (aj_i — and a = 1, 2a i+ i = 1 + y/ 4o| + 1 Nes83 (see |KC12 ). Theorem [i] holds 



for a stepsize /ij based on the RIP constant, which is unknown. In practice, the algorithm consistently converges 
as long as m < jjpp- 

Algorithm [3] shows implementation details that are important for keeping low-memory requirements. The 
implementation of maps like A and At depends on the structure of .4.; see section [5. 1| for explicit examples. 
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4 Convergence 



We assume the observations are generated by y = .4X* + e where e is a noise term, not to be confused with the 
approximation error e. In the following theorem, we will assume that ||^4|| 2 < mn/p, which is true for the quantum 
tomography example |Liull ; if A is a normalized Gaussian, then this assumption holds in expectation. 

Theorem 2. (Iteration invariant) Pick an accuracy e = ( ^ry ; where p is defined as in Theorem 1. Define t — r + p 
and let c be an integer such that £ — (c — l)r. Let fa — 2 (i+s — ) ^ n @ an< ^ assume 1 1 2 < mn/p and /(Xj) > 
C 2 |je|| 2 ; where C > 4 is a constant. Then the descent scheme Q or ^ has the following iteration invariant 



in expectation, where 



and 



Ef(X t+1 )<9f(X l )+r\\e\\\ (8) 



< 12 .i±W^_.i^ + (1 + e): 3 ^ 



1 — S cr \ 1 + S cr p 1 — 5; 



2r 



r< — 12-(l + e) 1 ' 



1 - 6 cr V V 1 - S: 

The expectation is taken with respect to Gaussian random designs in RandomizedSVD. If 9 < 9^ < 1 for all 
iterations, then limj_> 00 E/(Xi) < max{C 2 , — }||e|| 2 . 

Each call to RandomizedSVD draws a new Gaussian r.v., so the expected value does not depend on previous 
iterations. By Corollary 3.4 in NT09 , 8 cr < c ■ &2r, which allows us to put 9 and r in terms of bi r if desired, at a 
slight expense in sharpness. 

The expected value of the function converges linearly at rate 9 to within a constant of the noise level, and in 
particular, it converges to zero when there is no noise since C and r are finite. Note that convergence of the iterates 
follows from convergence of the function /: 

Corollary 1. ///(X,) < 1 , then ||X, - X*||| < ( ^+g a) ' ■ 

Proof. By the R-RIP and the triangle inequality, 



^l + S 2r (A)\\X l - X*\\ F < \\A(Xi) - A{X*)\\ 2 

= ||(A(X i )-y)-(A(X*)-y)|| 2 
<||(^(X l )-y)|| 2 + ||e|| 2 



□ 



Corollary 2 (Exact computation). If e — and there is no additional convex constraint C, then 9 = -fff^ (1 + ^) 
and t = 1 + j^rfj- , hence 6 < 1 if <5 2r < 3+4/c • 

Corollary [2] shows that without the approximate SVD, the R-RIP constants are quite reasonable. For example, 
with exact computation and no noise, any value of <5 2r < 1/3 implies that lim^oo Xj = X*. With noise, choosing 
C = 4 gives S 2r = 1/5 and 9 = 3/4, r = 3/2 and thus lim^^ /(X*) < max{16, 6}||£|| 2 . 

Note that the theorem gives pessimistic values for e. We want the bound on 9 to be less than 1 in order to have 
a contraction, so we need 

12 • l ± 5 ^ . ' . ™ + 12(1 + e) • 1 ± 6 ^ ■ < 1 

1 — o cr 1 + o cr p 1 — o cr 1 — 2r 

„ ' v . ' 

I II 

For a rough analysis, we will give approximate conditions so that each of the I and II terms is less than 0.5. It is 
clear that the terms blow up if S cr — > 1, so we will assume S cr <ti 1 (and hence S 2r Cl)- Then setting 1 + <5 2r ~ 1 
in the numerator of I, we require that 

12 emn 1 . . 
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which means that we need e < 2imn • ^ or Q uan tum tomography, m — n and p = O(rn), so we require e < 0{r/n). 
From Theorem [I] our bound on e is r/(p — 1), so we require p ~ n, which defeats the purpose of the randomized 
algorithm (in this case, one would just do a dense SVD). Numerical examples in the next section will show that p 
can be nearly a small constant, so the theory is not sharp. 

For the II term, again approximate 1 + S 2r ~ 1 an d then multiply the denominators and ignore the S cr d2 r term 
to get 

72<S cr (l + e) < 1 - S 2r - S cr . (10) 
Since certainly e < 0.5 and 5 2r + <>cr < 0.5, a sufficient condition is 5 cr < 1/216, which is reasonable (cf. |JNS12| ). 



5 Numerical experiments 

5.1 Application: quantum tomography 

As a concrete example, we apply the algorithm to the quantum tomography problem, which is a particular instance of 
0. For details, we refer to GLF + 10|FGLE12 . The salient features are that the variable X 6 C nx ™ is constrained to 



be Hermitian positive-definite, and that, unlike many low-rank recovery problems, the linear operator Ji satisfies the 



R-RIP: Liull establishes that Pauli measurements (which comprise A.) have R-RIP with overwhelming probability 
when p = 0(rn log 6 n). In the ideal case, X* is exactly rank 1, but it may have larger rank due to some (non- 
Gaussian) noise processes, in addition to AWGN e. Furthermore, it is known that the true solution X* has trace 
1, which is also possible to exploit in our algorithmic framework. 

Since X is Hermitian, the u and v terms in the algorithm are identical. Several computations can be simplified 
and there is a version of Algorithm[l]which exploits the positive-definiteness to incorporate a Nystrom approximation 



(and also forces the approximation to be positive-definite); see HMT11 GM13| . Here, we focus on showing how 
the functions A and At can be computed (due to the complex symmetry, At* = At). 

In quantum tomography, the linear operator has the form («4.(X))j = (Ej,X) where Ej = E* is the Kronecker 
product of 2 x 2 Pauli matrices. There are four possible Pauli matrices <J X ^^ Z if we define 07 to be the 2x2 identity 
matrix. For a ^-qubit system, E 3 = <j^\ ® <Tj 2 ® . . . <8> <Jj qb ■ For roughly 12 qubits and fewer, it is simple to calculate 
-4.(X) by explicitly forming Ej and then creating a sparse matrix A with the j th row of A equal to vec(E.,) so that 
«4.(X) = Avec(X). For larger systems, storing this sparse matrix is impractical since there are p > n rows and 
each row has exactly n non-zero entries, so there are over n 2 entries in A. 

To keep memory low, we exploit the Kronecker-product nature of Ej and store it with only qt> numbers. When 
X = xx*, we compute (Ej,X) = tracc(E 3 xx*) = trace(x*Ej-x), and E^x can be computed in O(qbn) time. This 
gives us A. The output of A is real even when X is complex. 

To compute At(z,w) when the dimensions are small, we just explicitly form the matrix M = -4.(z) and then 
multiply Mw. To form M, we use the same sparse matrix A as above and reshape the n 2 vector A*z into anxn 
matrix. For larger dimensions, when it is impractical to store A, we implicitly represent M = Y^j=i z j^j anc ^ thus 
Mw = Y?j=i Z j-E?w. I n general, the output is complex. However, if it is known a priori that X is real-valued, 
this can be exploited by taking the real part of M. This leads to a considerable time savings (2x to 4x), and all 
experiments shown below make this assumption. 

In our numerical implementation, we code both A and At in C and parallelize the code since this is the most 
computationally expensive calculation. Our parallelization implementation uses both pthreads on local cores as 
well as message passing among different computers. There are two approaches to parallelization: divide the indices 
j = 1, . . . ,p among different cores, or, when x or w has several columns, send different columns to the different 
cores. Both approaches are efficient in terms of message passing since A. is parameterized and static. The latter 
approach only works when x or w has a significant number of columns, and so it does not apply to Lanczos methods 
that perform only matrix-vector multiplies. 

Recording error metrics can be costly if not done correctly. Let X = xx* and Y = yy* be rank-r factorizations. 
For the Frobenius norm error ||X — Y||^ which requires n 2 operations naively, we expand the term and use the 
cyclic invariance of trace to get ||X — Y|||, = tracc(x*xx*x) + trace(y*yy*y) — 2 trace( x*yy*x ), which requires 
only 0(nr 2 ) flops. In quantum information, another common metric is the trace distance NC10 ||X — Y||*, where 



is the nuclear norm. This calculation requires flops if calculated directly but can also be calculated 



cheaply via FactoredSVD on U = V = [x, y] and D = [I, 0; 0, —I]. The third common metric is the fidelity NC10 
given by ||X 1 / 2 Y 1 / 2 || >t . If either X or Y is rank-1, this can be calculated cheaply as well. 



G 



5.2 Results 




Iterations 



Figure 1: (Left) Convergence rate as a function of parameters to RandomizedSVD/RandomizedEIG. (Right) Com- 
parison of just eigenvalue computation times via three methods. 




Dimension n Dimension n 

Figure 2: Mean time of 10 iterations: this includes the matrix multiplications as well as eigenvalue computations. 
(Left) shows times for a complete iteration of our method on a single computer using sparse matrix multiplies ("full 
memory") and, above 11 qubits, the custom low-memory implementation as well (not multi-threaded) on the same 
computer. (Right) shows times for just the RandomizedSVD/RandomizedEIG. 

Figure [l] (left) plots convergence and accuracy results for a quantum tomography problem with 8 qubits and 
p = 4rn with r = 1. The SVP algorithm works well on noisy problems but we focus here on a noiseless (and truly 
low-rank) problem in order to examine the effects of approximate SVD/eigenvalue computations. The figure shows 
that the power method with q > 1 is extremely effective even though it lacks theoretical guarantees; without the 
power method, take p ~ 20 and we see convergence, albeit slower. When p is smaller and the R-RIP is not satisfied, 
taking p or q too small can lead to non- convergence. 

Figure [I] (right) is a direct comparison of RandomizedEIG (with p = 5 and q = 3) and the Lanczos method for 
multiplies of the type encountered in the algorithm. The RandomizedEIG has the same asymptotic complexity but 
much better constants. 

Figure [2] shows that because the eigenvalue decomposition is a significant portion of the computational cost, 
using RandomizedEIG instead of Lanczos makes a difference. The difference is not pronounced in the small-scale 
full-memory implementation because the variable X is explicitly formed and matrix multiplies are relatively cheap 
compared to other operations in the code. For larger dimensions with the low-memory code, X is never explicitly 
formed and multiplying with the gradient is quite costly. The randomized method requires fewer multiplies, ex- 
plaining its benefit. For 12 qubits, the Lanczos method averages 98.4 seconds/iteration, whereas the randomized 



7 




5 10 15 20 25 30 



5 10 15 20 25 30 



Figure 3: Top row: e for (left) q — and (right) q — 1 power iterations. Bottom row: e for q = 2 power iterations 
(left), and (right) shows the bound e. 
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Figure 4: The table (left) shows error metrics for the noisy rank-1 16-qubit recovery. The figure (right) shows the 
convergence rate for the same simulation. 



method averages just 59.2 seconds. The right subfigure shows that the low-memory implementation (which has 
memory requirement 0(rn)) still has only 0(n 2 ) time complexity per iteration. 

Figure [3] tests Theorem [I] by plotting the value of 

e= ||X-X|H./||X-X r |||,-l 

(which is bounded by e) for matrices X that are generated by the iterates of the algorithm. The algorithm is set 
for r = 1 (so X is the sum of a rank 2 term, which includes the Nesterov term, and the full rank gradient), but 
the plots consider a range of r and a range of oversampling parameters p. The plots use q — 0, 1 (top row, left to 
right) and q = 2 (bottom row, left) power iterations. Because X has rank I = r + p, it is possible for e < 0, as we 
observe in the plots when r is small and p is large. For two power iterations, the error is excellent. In all cases, the 
observed error 1 is much better than the bound e (shown bottom row, right) from Theorem [Tj suggesting that it 
may be possible to have a more refined analysis. 

Finally, to test scaling to very large data, we compute a 16 qubit state (n — 65536), using a known quantum state 
as input, with realistic quantum mechanical perturbations (global depolarizing noise of level 7 = 0.01; see |FGLE12| ) 



as well as AWGN to give a SNR of 30 dB, and p = 5n = 327680 measurements. The first iteration uses Lanczos 
and all subsequent iterations use RandomizedEIG using p = 5 and q — 3 power iterations. On a cluster with 10 
computers, the mean time per iteration is 401 seconds. The table in Fig. [4] (left) shows the error metrics of the 
recovered matrix, and Fig. [4] (right) plots the convergence rate of the Frobenius-norm error and trace distance. 

Figure [5] reports the median error on 20 test problems across a range of p. Here, X* is only approximately 
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Number of measurements (divided by n*r) 



Figure 5: Accuracy comparison of several algorithms, as a function of number of samples p. Each point is the 
median of the results of 20 simulations. 



low rank and y is contaminated with noise. We compare the convex approach FGLE12 , the "AltMinSense 



approach JNS12 , and a standard splitting approach. AltMinSense and the convex approach have poor accuracy; the 
accuracy of AltMinSense can be improved by incorporating symmetry, but this changes the algorithm fundamentally 
and the theoretical guarantees are lost. The splitting approach, if initialized correctly, is accurate, but lacks 
guarantees. Furthermore, it is slower in practice due to slower convergence, though for some simple problems (i.e., 



no convex constraints C) it is possible to accelerate using L-BFGS Laul2 



6 Conclusion 



Randomization is a powerful tool to accelerate and scale optimization algorithms, and it can be rigorously included 
in algorithms that are robust to small errors. In this paper, we leverage randomized approximations to remove 
memory bottlenecks by merging the two-key steps of most recovery algorithms in afhne rank minimization problems: 
gradient calculation and low-rank projection. Unfortunately, the current black-box approximation guarantees, such 
as Theorem [T] are too pessimistic to be directly used in theoretical characterizations of our approach. For future 
work, motivated by the overwhelming empirical evidence of the good performance of our approach, we plan to 
directly analyze the impact of randomization in characterizing the algorithmic performance. 
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A Proofs 



Proof of Theorem [1| There are three aspects to the proof. Even without approximate SVD calculations, the problem 
is non-convex, so we must leverage the R-RIP to prove that iterates converge. Mixed in with this calculation is the 
approximate nature of our rank I point Xj+i, where we will apply the bounds from Theorem [l] Finally, we relate 
Xj_|_i to its rank r version Xj+i. 

An important definition for our subsequent developments is the following: 
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Definition 2 (e-approximate low-rank projection). Let X be an arbitrary matrix. For any e > 0, V*, «(X) provides 
a rank-i' matrix approximation to X such that 

E||^,(X)-X||* < (l + e)||7V(X)-X||^ (11) 
where V r '(X) 6 argmin Y:ran k(Y)<r' 

NX — Y|| F . 

Let Xj be the putative rank r solution at the i-th iteration, X* be the rank r matrix we are looking for and X^+i 
be the rank I matrix, obtained using approximate SVD calculations. Define L := 2(1 + 5 r+ e) and M := 2(1 — ^2r)- 
Then, we have: 

/(Xi+i) = /(X,) + (V/(X. ( ), X i+1 - Xi) + ||.A(X i+1 - X0|| 2 F 
< /(X^ + (V/fXi), Xj+i - X,) + |||X i+1 - XiHl 

= /(Xi) - ^||V/(X 4 )|| 2 F + I (j|X t+1 - XiHl + 2<lv/(X i ) ! X i+1 - X 4 ) + ^||V/(Xi)|| F 
= /(Xi) - ^||V/(X 4 )|| 2 F + |||X i+1 - (X s: - ^-V/(X,)) || 2 F . (12) 
By construction Xj + i €V* t (Xj — ^V/(Xj)) (since the step-size is fx = 1/L), so, for X+i G £r (Xi — |;V/(Xj)), 
E||X i+1 - (X, - iv/(Xi))||| < (1 + e)||X +1 - (X, - Iv/(X t ))|| 2 F 

<(l + e )||X*-(X l -iv/(X. i ))|| F (13) 



by the definition of V r (-) (since rank(X*) = r). Combining (13) with ( |T2| ), we obtain: 
E/(X i+1 ) < /(Xi) - ^||V/(Xi)||| + |(1 + e)||X* - Xi + iv/(Xi)||| 

= /(X,) - i||V/(Xi)|| 2 F + (1 + e) (^||V/(Xi)|| F + (V/(Xi), X* - Xi) + |||X* - X t \\ F 



<(l + e) 



/(Xi) + (V/(Xi), X* - X,) + |||X* - X t \\% 



^l|V/(X,)|| F (14) 



where we use the fact that /(Xj) > in the last inequality Due to the restricted strong convexity of / that follows 
from the restricted isometry property, we have: 

/(X*) > /(Xi) + (V/(Xi),X* - X,) + y ||X* - X,|| 2 F 
/(X*) - y||X* - XJ 2 , > /(Xi) + (V/(Xi),X* - Xi) 



which, combined with ( |14| , leads to: 

E/(X i+1 ) < (1 + e) 

Due to the R-RIP, 



/(X*) + - — — ||X*-X 



2 

i\\F 



( HV/(X j; )|| F (15) 



2i 



, <M t3 (16) 

1 - 02r 

Now define a constant C and assume /(X^) = ||y — ^4X^ ||| > C 2 ||£||2 (if the assumption fails, it means X^ is already 
close to X*). In particular, in the noiseless case ||£|| = 0, we may pick C arbitrarily large and set all 1/C terms to 
zero. 

\\A{X*-X i )f F = \\y-A(X i )-ef 2 

= ||y - A(Xi)|| 2 + || £ || 2 - 2<e,y - A(X t )) 
<f(X i ) + \\ef 2 + 2\\e\\ 2 \\y-A(X i )\\ 2 

< /(Xi) + \\ef 2 + |/(Xi) (17) 
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Substituting (17) and (16) into (15), expanding the values of L and M, and noting that /(X*) = ||y — «4.(X*)| 



||e||f, gives 



E/(X i+1 )<(l + e) 
<(l + e) 



$2r 



1 - hr 



12 + ^/(Xi 



2L 



iv/cx^m 



<5 r+ ^ + i5 2r 
1-5. 



2r 



l + ^J/Cx,) 



2r 



2L 



|V/(Xi 



We bound ||V/(Xj)|| using our assumption on the magnitude of ||.A||: 



|V/(Xi)||^ = A\\A* (y - AQQ) || F < 4\\A*\\ J \\y - ApUM = 4||^||V(Xi) < 4— /(X, 



For quantum tomography, we even have AA* 



-I, so the inequality holds with equality (and m = n). 



Combining (19) with (20) and by the definition of L, we obtain: 

5 r+ e + 52r 



E/(X i+ i) < (1 + e) 



5r+e + $2r 



1 - 5 



■li- 
mn 



1 



1 + 5 r+e p 



+ + 



r . , /(Xi) 



1 



1 - <5 2 r 



kill 



1 - 5 2r 



1 



c 



/(X<) + (l + e) 1 + 



1 + <5 r+ £ p 



/(X<) 



1-* 



2c 



Mil 



*M|2 
2 



(18) 
(19) 

(20) 



(21) 
(22) 



Note that if an exact SVD computation is used, then not only is e = but also Xj+i is rank r, so we are done 
and can use 6 = 9' and r = t' . To finish the proof, we now relate E/(Xj+i) to E/(Xj+i). In the algorithm, Xj+i 
is the output of RandomizedSVD, and Xj+i is the intermediate value WEV* on line 10 of Algo.[lj Given Xj+i with 
rank(Xi + i) = £ > r, Xj+i is defined as the best rank-r approximation to Xi + i0 Thus, the following inequality 
holds true: 



|Xi+i — X*||f — ||Xj+i — Xi+i + Xj + i — X 

— ||Xj+i — Xj+i II f + |Xj + i — X*||_f 
<2||X m -X*|| F 



(23) 



since ||Xj+i — Xj_|_i|| F < ||X* — Xj_)_i||j?. In particular, since the above is valid for any value of the random variable 
Xj+i, E ||X,; + i — X*|| F < E 4||Xj+i — X*|| F . This bound is pessimistic and in practice the constant is close to 1 
rather than 4. 

We will again assume that /(Xj+i), /(X i+ i) > C 2 ||e|| 2 , and C > 2, since otherwise the current point is a 
good-enough solution. We have: 



/(X i+1 ) = ||y - A(K i+1 )f 2 = \\A(X* - X i+1 ) + eg 

-||A(X*-X l+ |J 



= || A(x* - x i+1 ; 
= ||A(x*-x i+1 ; 

< ||A(x*-x i+1 ; 

< \\A(x.* - x i+1 ; 



41 
41 



2(A(X*-X i+1 ) )£ ) 
2{y-A(X i+1 )-e,e) 



e\\j + 2(y - Apt i+ i),e) + 2(-e,e) 
e||^+2||y-A(X i+1 )|| 2 ||e|| a -2||e|| 



C 



/(X i+ i) 



which, ifl — 2/C>0, implies 



/(X 



< 



1-2/C 



|.A(X* -X 



i + i;il2 



1-2/C" 



(24) 



4 If we include a convex constraint C then instead of defining Xj+i = P r (Xi+i) we have Xj+i = Vc^Pr{^-i+\))- In this case, 
||7> c (7V(X i+1 )) - X*\\ F = \\V c (V r (± l+1 ) - X*)\\ F < ||P r (Xi+i) - X*|| F . 

The first equality follows from X* £ C and the second is true since the projection onto a non-empty closed convex set is non-expansive. 
Hence the result in (123} still applies when we include the C constraints. 
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By the R-RIP assumption, we have: 



\A(X* - X i+1 )||l < (1 + <yar)||X* - X l+1 |||. 



Using (|23j) and (|25j in (|24j), we obtain: 

/(X l+1 ) < lii±^l ||X l+1 - xi* - j^lkll? 

Using the R-RIP property again, the following sequence of inequalities holds: 



(25) 



(26) 



X i+1 — X 



*l|2 



< 



|A(X 



i+1 



'*M|2 



< 



1 — S r +e 
1 + 2/C 



1-5., 



/(Xj_) 



1 



1-5. 



(27) 



where the second inequality is obtained following same motions as \Y1\ - Combining ( 26 >- ( 27 1 with (22), we obtain: 
4(l + 5 2r ) 1 + 2/C al 



E/(X j; _ 



< 



1 - 2/C 1 - 5 r 



0' -/(Xi) 



4(1 + *ar) 1 + 2/C , , 4(l + <5 2r ) 

T 



1 



1 



1-2/C l-£ 



r+t 



1 - 2/C 1 - 5 r+f 1 - 2/C 



Now we simplify the result to make it more interpret able. Define p = £ — r. Let c be the smallest integer such 
that £ > (c — l)r (and for simplicity, assume £ = (c — l)r) so that i5 r +£ = 5 cr and <5 r +£ + (52 r < 28 cr . By Theorem[TJ 

e < 



— | c -2 r )r-i ■ For concreteness, take C > 4 so that 1 + 2/C < 3/2 and (1 - 2/C)" 1 < 2. Then 

1 + S 2r 



< 12 • 



l-S r 



and 



r < 12 



< 



l+jg 
1 + <5 2r 

1 - 5nr 



e mn , H . 3(5 cr 
i ? + 1 + e )i T- 



8(1 + 5 2r ) 



(28) 



+ 1 + 



12- (1 + e) 1 + 



1 - hr 
2S cr 



1 - S r 



1 - 5. 



2r 



(29) 

□ 
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